Moving the Bytes with Bao
Rüdiger Klaehn
n0.computer
2023-04-15
“Perfection is achieved, not when there is nothing more to add, but
when there is nothing left to take away.”
Antoine de Saint-Exupéry
Outline
- Iroh primitives
- What do we want
- How does bao enable this
- Applications
- Is this IPFS?
Iroh primitives
- Iroh v2 was started in January 2023
- Already used in production
- Goal is to be extremely simple
Blobs
- arbitrary size, from bytes to terabytes
- filestore pattern by default
- data itself remains under user control
Collections
- collections are blobs
- sequence of links (with some metadata)
- arbitrary size, billions of links possible
- links typically to blobs
- same hash algo (blake3) for all links
Verified streaming
- only send content for requested hash
- incremental verification
- validate on send
- validate on receive
Data transfer protocol
- request/response
- not a discovery protocol
- verified streaming
Request
- root hash
- ranges for each element
- collection first, at offset 0
- children in order of links
Response
- concatenated response for each non
empty item
- just the bytes?
- enter bao
What is blake3
- Cryptographic hash function
- Internal ephemeral merkle tree
- Binary tree
- 1024 byte chunks
What is Bao
- bao is blake3
- tree is persisted
- stored either
- in separate file (outboard)
- mixed with the data
- we use outboard
Provider
- get request
- traverse relevant part of tree
- in pre-order
- send hash pairs and chunks
Requester
- tree structure depends only on size
- read size
- use tree iterator to figure out what to expect
- validate every time a new piece is received
- detect corruption after 1 item (pair or chunk)
Summary
- Provider
- traverse tree
- read from data and outboard
- validate and write to socket
- Receiver
- traverse tree to know what to expect
- read from socket and validate
- use data (stream or store)
Overhead
- best case, large part or entire file
- 1/256 of data size
- 4 GiB for 1 TiB file
- worst case, 1 chunk in a 1 TiB file
- 64 * log2(chunks)
- 1920 bytes
Chunk groups
- outboards can get quite large
- 1/16 of data, 64 GiB for 1 TiB file
- idea: store only higher levels the tree
- outboard fits in mem even for giant files
- also reduces overhead when sending
Applications
- database sync
- ML weights
- git lfs
- game assets
- directory trees
- disk images
Future plans
- append only
- arbitrary writes
- sync live db images
- single writer mutable file system
Limitations
- deep and highly dynamic DAGs
- rabin chunking
- prolly trees
- you can put collections in collections,
but
- currently no plan for dag aware gc
Linux kernel
stats for linux.car
branch size: 4740977
leaf size: 1293774012
Wikipedia math
stats for wiki_math.car
branch size: 33407570
leaf size: 3541099753
Lotus testnet
stats for lotus_testnet.car
- branch size: 414198
- leaf size: 23932
Content-addressing
- every byte is content-addressed
- incremental verification everywhere
- we don’t even trust the file system
- hence validate on send
Components
- libp2p -> quic
- unixFS: no*
- we have file system support
- IPLD: no*
- Bitswap: no
- IPNS: future plan
- HTTP Gateway: user
- DHT: future plan
- (*) DAG gc support: monetized use
case
Post-order outboard
- pre-order outboard
- moves nodes around when adding
- not good for append only
- post-order outboard
- nodes stay fixed when appending data
- good for appending
What if blake3 gets broken
- collections contain header
- use other hash fn
- construction of bao is independent of hash fn
- could do it with SHA2-256
- not our biggest concern atm
How would you support dags
- plugin fn that takes collection and returns
bitmap
- bitmap says which links are to collections
- then do normal dag liveness like in ipfs-embed
Collections as links
- Collections currently mix metadata and links
- Could be just a sequence of links
- Metadata in first child
- Exact metadata format up to users
- We would provide reference impls for directory tree
/ file system
What about other hashes
- Nogo, we rely on blake3 stream validation
- In trusted scenarios:
- mapping to blake3 hash
- then use that to download
- verify at the end
Discovery protocol
- basically like requests
- you get back info about what is there
- range spec seq